Hedera: Scalable Indexing, Exploring Entities in Wikipedia Revision History

نویسندگان

  • Tuan A. Tran
  • Tu Ngoc Nguyen
چکیده

Much of work in semantic web relying on Wikipedia as the main source of knowledge often work on static snapshots of the dataset. The full history of Wikipedia revisions, while contains much more useful information, is still difficult to access due to its exceptional volume. To enable further research on this collection, we developed a tool, named Hedera, that efficiently extracts semantic information from Wikipedia revision history datasets. Hedera exploits Map-Reduce paradigm to achieve rapid extraction, it is able to handle one entire Wikipedia articles’ revision history within a day in a medium-scale cluster, and supports flexible data structures for various kinds of semantic web study.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

TokTrack: A Complete Token Provenance and Change Tracking Dataset for the English Wikipedia

We present a dataset that contains every instance of all tokens (≈ words) ever written in undeleted, non-redirect English Wikipedia articles until October 2016, in total 13, 545, 349, 787 instances. Each token is annotated with (i) the article revision it was originally created in, and (ii) lists with all the revisions in which the token was ever deleted and (potentially) re-added and redeleted...

متن کامل

How to Trace and Revise Identities

The Entity Name System (ENS) is a service aiming at providing globally unique URIs for all kinds of real-world entities such as persons, locations and products, based on descriptions of such entities. Because entity descriptions available to the ENS for deciding on entity identity—Do two entity descriptions refer to the same real-world entity?—are changing over time, the system has to revise it...

متن کامل

Wikipedia Revision Toolkit: Efficiently Accessing Wikipedia's Edit History

We present an open-source toolkit which allows (i) to reconstruct past states of Wikipedia, and (ii) to efficiently access the edit history of Wikipedia articles. Reconstructing past states of Wikipedia is a prerequisite for reproducing previous experimental work based on Wikipedia. Beyond that, the edit history of Wikipedia articles has been shown to be a valuable knowledge source for NLP, but...

متن کامل

Extracting Wikipedia Historical Attributes Data

In this paper, we describe the collection of a large structured dataset of temporally anchored relational data, obtained from the full revision history of the English Wikipedia. By mining (attribute, value) pairs from this revision history, we are able to collect a comprehensive, temporally-aware knowledge base that contains data on how attributes change over time. We discuss different characte...

متن کامل

Modeling Events in Time using Cascades of Poisson Processes

Modeling Events in Time using Cascades of Poisson Processes by Aleksandr Simma Doctor of Philosophy in Computer Science and the Designated Emphassis in Communication, Computation, and Statistics University of California, Berkeley Professor Michael I. Jordan, Chair For many applications, the data of interest can be best thought of as events – entities that occur at a particular moment in time, h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014